An Experiment in Hybrid Dictionary and Statistical Sentence Alignment
نویسندگان
چکیده
The task of aligning sentences in parallel corpora of two languages has been well studied using pure statistical or linguistic models. We developed a linguistic method based on lexical matching with a bilingual dictionary and two statistical methods based on sentence length ratios and sentence offset probabilities. This paper seeks to further our knowledge of the alignment task by comparing the performance of the alignment models when used separately and together, i.e. as a hybrid system. Our results show that for our English-Japanese corpus of newspaper articles, the hybrid system using lexical matching and sentence length ratios outperforms the pure methods.
منابع مشابه
A Hybrid Approach to Align Sentences and Words in English-Hindi Parallel Corpora
In this paper we describe an alignment system that aligns English-Hindi texts at the sentence and word level in parallel corpora. We describe a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. We use regression techniques in order to learn parameters which characterise the relationship between the lengths of two sentences in p...
متن کاملSentence Alignment of Historical Classics based on Mode Prediction and Term Translation Pairs
Parallel corpora are essential resources for the construction of bilingual term dictionary of historical classics. To obtain large-scale parallel corpora, this paper proposes a sentence alignment method based on mode prediction and term translation pairs. On one hand, the method rebuilds the sentence alignment process according to characteristics of the translation of historical classics, and a...
متن کاملMathLingBudapest: Concept Networks for Semantic Similarity
We present our approach to measuring semantic similarity of sentence pairs used in Semeval 2015 tasks 1 and 2. We adopt the sentence alignment framework of (Han et al., 2013) and experiment with several measures of word similarity. We hybridize the common vector-based models with definition graphs from the 4lang concept dictionary and devise a measure of graph similarity that yields good result...
متن کاملLog-Linear Models for Word Alignment
We present a framework for word alignment based on log-linear models. All knowledge sources are treated as feature functions, which depend on the source langauge sentence, the target language sentence and possible additional variables. Log-linear models allow statistical alignment models to be easily extended by incorporating syntactic information. In this paper, we use IBM Model 3 alignment pr...
متن کاملA New Combined Lexical and Statistical based Sentence Level Alignment Algorithm for Parallel Texts
Parallel texts alignment is an active research area in Natural Language Processing field. In this paper, we propose a method for sentence alignment of parallel texts that is based both on lexical and statistical information. The alignment procedure uses dynamic programming technique. We made our experiments for Spanish and English texts. We use lexical information from bilingual Spanish-English...
متن کامل